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Opinion analysis is by a long shot most basic zone of characteristic language 
handling. It manages the portrayal of information to choose the motivation 
behind the wellspring of the content. The reason might be of a type of 
gratefulness (positive) or study (negative). This paper offers a correlation 
between the outcomes accomplished by applying the calculation 


arrangement using various classifiers for instance K-nearest neighbor and 
multinomial naive Bayes. These techniques are utilized to assess a 
Keywords: significant assessment with either a positive remark or negative remark. The 
gathered information considered on the grounds of the extremity film 


seine lecti datasets and an association with the results accessible proof has been created 
eatures selection for a careful assessment. This paper investigates the word level count 
Film ; vectorizer and term frequency inverse document frequency (TF-IDF) 
Guj arati influence on film sentiment analysis. We concluded that multinomial naive 
Precision Bayes (MNB) classier generate more accurate result using TF-IDF 
Sentimentality vectorizer compared to CountVectorizer, K-nearest-neighbors (KNN) 
classifier has the same accuracy result in case of TF-IDF and 
CountVectorizer. 
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1. INTRODUCTION 

Regular language measure is utilized in the region of examination of human conclusion and feeling, 
for the most part, centers around feeling, disposition, and assessment because of increment utilization of the 
web [1], [2]. In Indian dialects like Gujarati, language autonomy is basic because of helpless assets, 
Robustness, and versatility. In this paper, feeling investigation is done on film evaluations available in the 
Gujarati language to accomplish best in class execution utilizing different AI rehearses. In today’s world 
where everyone relies on web. Sentiment of user becomes important entity because, if anyone wants to buy a 
new product, enroll for a course, or want to watch a movie they will first find out the review and based on 
that review they are making their decisions. Due to increased use of web, automation is required so here role 
of natural language processing (NLP) become crucial. Number of NLP based applications are available used 
for text translation, retrieval and summarization that is helpful in identifying opinion or feedback people, 
spam detection, fake news identification and providing digital medical assistant. A broad research is currently 
going on towards developing NLP based application for Indian language [3]. 

Usage of web/internet has grown faster which gives growth and opportunities for Indian market. 
Information is generated with high volume and velocity therefore users in India access web in their regional 
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languages, almost all the information’s from review of product/movie, news to advertisement’s is also 
available in regional language on a web. Which gives a scope for many researchers to explore field of NLP 
and sentiment analysis is one of the important aspects of NLP, so it becomes essential to develop resources to 
analyze sentiment in Indian languages. 

There are two major families are available Arya, Dravida to which Indian languages belongs to and 
22 authority dialects spoken in India. Indo-Aryan family comprise of dialects as Hindi, Urdu, Bengali, Oriya, 
Punjabi, Konkani, Marathi, Nepali, Gujarati, Sindhi, Dogri, Assemese, Sanskrit and Kashmiri. Dravidian 
family comprises of dialects like Telugu, Tamil, Malayalam, and Kannada [4]. 

Sentiment analysis for movie review is relatively recent and movie review in Gujarati languages is 
still area of research. So far there is no dataset available which contains movie review in Guajarati language 
and no research available for identifying sentiment of movies from the review that is written in Guajarati 
language. Lexicon based and machine translation approach is so far most used technique for identifying 
sentiment in Guajarati Language till now. In lexicon-based approach a dictionary of words is created which 
contains polarity of word and sentiment is identify by sum of polarity (each word in sentence) if it is greater 
than 1 then its positive sentiment else its negative sentiment. In latest paper they have used lexicon-based 
sentiment analysis to classify tweets available in Gujarati language by creating SentiWord dictionary for 
Gujarati words using IndoWordNet interface [2]. Enough resources are available for English Language 
therefore so many research available for sentiment analysis in English language. As of now information on 
web available in regional language also it motivates us to perform sentiment analysis on Gujarati language 
which is 6th highest speaking language in India. We have prepared dataset for movie review in Gujarati 
language to perform sentiment analysis, but it is a challenging task. Sufficient resources are not available 
such as corpus and language tagger, makes assessment a troublesome endeavour. We must create our own 
dataset as no standard dataset available this also requires efforts, research, and time from our end. We must 
perform text processing for generation more accurate result. To apply machine learning based technique 
without any valid dataset for result analysis was a challenging task though we have overcome these 
challenges and mange to produce satisfactory results by applying machine leaning based technique with 
term frequency inverse document frequency (TF-IDF) and CountVectorizer (CV) as feature selection 
technique [4]. 

There are different components determination methods like N-Gram, TF-IDF, count vector and 
word incorporating that are accessible for Al-based more tasteful, and execution of this classifier can be 
estimated with various execution boundaries, for example, recall, F-score, accuracy and precision [5]. 
Preprocessing the initial huge phase in text order which incorporates tasks like sentence/word tokenization, 
stop word expulsion, evacuation of unique characters, and numbers. Next advance is feature determination 
[6], [7]. Numerous strategies are accessible for feature determination, for example, a bunch of words, TF- 
IDF, count vectors, and word embeddings which depend on characteristic language preparing [8], [9]. The 
last advance is to apply AI calculation for the order of perspective, for example, K-nearest-neighbors (KNN) 
and multinomial naive Bayes (MNB). Impact of two-word level features count and TF-IDF vectorizer 
assortment have been tended to in this paper with the utilization of two classifiers MNB and KNN for 
discerning opinion exactness. 

In this paper, we have proposed machine learning based sentiment analysis for movie reviews in 
Gujarati language (MSAGL). As shown in Table 1, there is extensive research is done in language like Hindi 
but no sufficient research available in language like Gujarati here we reach out on that work severally [4]. In 
step one dataset for movie review (in Gujarati language) is created by extracting reviews from a website in 
https://gujarati.webdunia.com/with the extremity is ready for investigation which comprises of negative 
survey spoke to by 0 and positive audit spoke to by 1. All surveys are kept up in a comma separate record. In 
step two pre-processing of dataset is done by removing unwanted characters, words and noise as reviews are 
collected from internet so polishing of data is important to achieve more accurate result. In stage three 
tokenization process is done on dataset using TF-IDF and CV feature selection method. In stage four 
Separated features are feed to two unique classifiers multinomial naive Bayes classifier and k-nearest 
neighbor to perform sentiment analysis. This model creates a confusion grid in the wake of preparing them. 
The confusion association shows the positive and surveys that are accurately and wrongly anticipated. In last 
stage assessment of each model is performed utilizing an execution boundary such as accuracy, precision, 
recall and F-score. A good amount of effort and time is invested to prepare dataset which contains movie 
review in Gujarati language (500 reviews) and to prepare stop word list for Gujarati language, for this work, 
is likewise our commitment and can be made accessible and used in future for research purposes as it were. 


2. LITERATURE SURVEY 
Significant amount of work has been done on sentiment analysis in past few years, but we have 
focused on two techniques machine learning and lexicon based that are widely used for different Indian 


Sentiment analysis on film review in Gujarati language using machine learning (Parita Shah) 


1032 O ISSN: 2088-8708 
languages as shown in Table 1 [10]-[21]. For the NLP task, they have utilized vector portrayals for effective 
use of word vector portrayal which gives feeling examination issue arrangement [8]. They thought about part 
of speech and vocabulary functionality alongside the artificial intelligence (AI) approach in particular support 
vector machine, logistic regression, and naïve Bayes [22]. A methodology like reliance parsing is utilized in 
this paper which shows how this methodology is guaranteeing the perspective of short content with the social 
movement and changed separation, challenges are moderate through assumption structure and the notion 
estimation standards [23]. They have utilized the SVM classifier with preprocessing steps, for example, 
stemming, stop words, non-English characters, and nullification expulsion to look at the viability of the film 
survey dataset [24]. Execution of perspective can be improved by utilizing feature determination procedures, 
for that they have applied ten distinctive component choice techniques with four classifiers [1]. The 
perspective examination using naive Bayes (NB) classifier on collected tweets (Hindi, Bengali, and Tamil) in 
Indian dialects [25]. Tokenization applied on the collected tweet in Indian dialects followed by feature 
mining with the utilization of SentiWordNet. For a huge dataset NB classier fails to convince otherwise 
superior performance is given in case of smaller size dataset. For mixed code information method used in this 
paper follows the process of translating whole content into different dialects called English to identify the 
extremity of translated content for assessment examination [26]. The interpretation of the code-blended 
content to a solitary language has a few restrictions, for example, accomplishing theoretical comparability, 
linguistic, and syntactic structure of the source language [27]. Vocabulary based analysis of sentiment is done 
in the Telugu language by utilizing SentiWordNet [28]. Vocabulary based attitude examination using 
SentiWordNet and approach based on machine learning used in this paper to identify a feeling of Telugu 
sentence into labels called positive and negative [29]. Two different classifiers have been used in this paper 
to identify improved performance in a language like English. With word-level N-Gram feature for word 
vectorization used with logistic regression (LR) and NB to be brought improvement in performance [30]. For 
auditing of item valuably neural network call multilayer perceptron is used for assumption order [31]. 


Table 1. Studies related to sentiment analysis in Indian language 


Author Techniques Dataset used Accuracy (%) Language 
Citations 
[4] Synset Replacement Algorithm (GUJ Gujarati Tweet 52.72 Gujarati 
SentiWordNet), WordNet, Bag-of words (unigram) 
[3] Neural Network Microblogging Not Measured Gujarati-English 
(Gujlish) 
[12 SVM Hindi tweets Hindi-49.68 Hindi, Bengali 
Bengali-43.20 
[11 Multinomial and Bernoulli Naive Bayes, Tamil Movie Reviews SVM-64.69 Tamil 
Logistic Regression, SVM, Random (Bigram) 
Kitchen Sink MNB-47.21 
(Bigram) 
[17 Synset Replacement Algorithm (Hindi Tweets, Movie Reviews and Blogs Not measured Hindi 
SentiWordNet) 
[13 WordNet, Bag-of words HindMonoCorp 0.5, IMDB11 75.53 Hindi-English 
Movie Review dataset 
[10 TnT Tagger Malayalam Movie Reviews 91.06 Malayalam 
[21 Lexicon based Hindi Tweets 73.53 Hindi 
[20 SVM Gujarati Tweets Not Measured Gujarati 
[14 Naive Bayes classifier Movie Reviews 87.1 Hindi 
[15 CRF for Aspect Extraction and SVM for Product Reviews 54.05 Hindi 
Classification 
[18 Dictionary Based, Naive Bayes and SVM Hindi Tweets related to Political 62.1 Hindi 
algorithm party in India during election 2016 
[19 Lexicon Based, LMC classifier Hindi speeches delivered by leaders Not Measured Hindi 
[16 Lexicon Based, SVM, Random Forests Not Specified Not Measured Hindi, Marathi 


This paper presents a successful semantic and extremity-based data recovery methodology for 
heterogeneous informational collections. Setting of the information inquiry is recognized and every one of 
the records that fulfill the extremity and setting of the info question are recovered from the information 
source [22]. They utilized syntactic and semantic probabilities acquired from the WordNet similitudes as the 
idea connection highlights to prepare the gullible Bayes classifier intended to learn idea relations. The 
credulous Bayes classifier is bootstrapped by utilizing an assumption augmentation method. The analysis 
directed utilizing benchmark datasets created promising outcomes. The viability of the proposed strategy was 
demonstrated by contrasting the exhibition and comparative well performing programmed philosophy 
development techniques [32]. In this paper, they characterized the extremity of the data article through 
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method of methods for watching the rankings got the utilization of valence aware dictionary for sentiment 
reasoning (VADER). For tweets or exceptionally speedy writings, record stage slant assessment is a 
marvelous decision. Notwithstanding, if profiling feeling around a chose factor or highlight of a brand, item, 
or company is to be done, after which sentence stage or substance stage assessment is thought of [33]. In this 
paper they have used vocabulary and code mix based approach to accomplishes higher results for identifying 
sentiment [34]. The experimentation on these paintings consists of sentiment evaluation at the paragraph, 
sentence, and phrase stage [35]. 


3. PROPOSED METHOD 

Figure | indicates the arranged informational index of film audits in the Gujarati language then we 
pre-handled the gathered information. In the subsequent stage for feature choice, we have utilized two 
techniques named TF-IDF and CountVectorizer, and those highlights are arranged utilizing two unique 
classifiers. Finally, we looked at the presentation of various classifiers dependent on the different exhibition 
measures. 


i A 5. Performance 
2. Pre-processing 3. Feature 4:Classification Evaluation 


Extraction 


1. Dataset 
Preparation 
Polarity Performance 
detection evaluation using 
Count using KNN Accuracy, 
vectorizer and MNB. Precision, Recall 
and F-score 


Collect movie Stop word TF-IDF and 


reviews in removal and 
Gujarati Tokenization. 
Language 


Figure 1. Proposed approach 


Classification steps followed are: 

a. Stage 1. Informational collection with the extremity is ready for investigation which comprises of 
negative survey spoke to by O and positive audit spoke to by 1. All surveys are kept up in a comma 
separate record. We have created dataset of movie reviews in Gujarati language by using python crawling 
(beautifulsoup library used). 500 movie reviews are collected from a website called Gujarati Webidunia 
in https://gujarati.webdunia.com/ and labelled it with 0 and 1 where 0 represents negative and 1 represents 
positive as shown Figure 2. 


text experience 


0 alaf As WHA atd asla Al syz URAA AN... 1 
1 wad Aay yä dA uuadt [sa Raka... 1 
2 Heol AAAA was Axa Ae ule dial w... 1 
3 aidlalaalell uA waa Mal HA D R AL... 1 
4 didl da syd? Halud SE Ð Ud HY H... 0 
5 Atel AR UlZlett AS yala BASA etal S... 1 
6 aid! fafay UA UAA HAs aidhah As UL... 1 
7 agh atal au (Saiz usah dd Holu... 1 
8 Sid AAS HASN UÀ AU SAHR AGUL... 1 
9 As shla UA lAs Sei] OAS) Dala We. 0 


Figure 2. Movie review dataset prepared in Gujarati language 
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b. Stage 2. Uncommon characters (@!) and pointless clear space are eliminated followed by the removal of 
words that do not have any estimation then tokenization is done in the pre-handling step as given in 
example. Original sentence = (#22 Hi (Se4 yA uul Ue widle wad A aR ua eels yg usA yalist 
ARS ALSL % È B. AH AA AN UR wld aa 9, UR A Hlad w rei d [H Hi S55 dl olay 8. After removal of 
special character will remove characters unwanted words and we will receive output sentence as (22 
(54 Uz] mad leet wad cals Ryg asa UdlAS1 dzs ASL WS uH AA AR Ald Hel Heid zg (Seu $55 ly 

c. Stage 3. Tokenized features are separated from clean information utilizing TF-IDF and count vectorizer 
techniques. Tokenization will split paragraph into sentence and sentence into word such as, ‘(ade’, (sev, 
yl, waddle’, wad, eels’, EH, sel, alist, Ars’, “Al SL, RÈ, Ute’, Mery wuld’, Ad’, aay 
Ulag’, el’, (SEH, S55", “ole. 

d. Stage 4. Separated features are feed to two unique classifiers (KNN, MNB). This model creates a 
confusion grid in the wake of preparing them. The confusion association shows the positive and surveys 
that are accurately and wrongly anticipated. 

e. Stage 5. The assessment of each model is performed utilizing an execution boundary. 


4. FEATURE SELECTION 
4.1. TF-IDF 

Assessment technique term occurrence features weigh the noteworthiness of a word in each archive. 
The recurrence of term event is determined as the occasions a term shows up in a report partition by the word 
occurrence in the archive. reverse document rate of recurrence likewise ascertains the significance of the 
term. Inverse document frequency (IDF) is determined as the number of records isolated by the number of 
reports containing the term t [36]. for instance, there are 300 words in the record, and out of those 20 words 
are generally incessant than term recurrence will be 20/300=0.066 and assume there are 7000 reports and out 
of that lone 200 archives contains specific term than IDF=7000/200=35. TF will be 0.06*100=6 and IDF will 
be 35. 


4.2. Count vectorizer 

Text is transformed into a vector by marking the presence (1) or absence (0) of a word of a given 
input [36]. The calculation count vectorizer is stated in Table 2. For the following sentences. The generated 
matrix contains 2 rows and 3 columns’, a row represents the presence and absence of feature from a sentence. 
Sentence 1=4dl UÑ 8. 
Sentence 2=yd] WAL a a. 


Table 2. CountVectorizer matrix generation 


Sentences yd YA al 
(Featurel) (Feature2) (Feature3) 

Sentence 1 1 0 1 

Sentence 2 1 1 1 


5. CLASSIFICATION ALGORITHM 
5.1. Multinomial naïve Bayes 

Calculation based on probability of conditional independence between each pair of features is called 
Bayes’ theorem and MNB classifier follows the principle of Bayes’ theorem [37]. Consider (1): 


P (class) 


P(class|feature) = P(feature|class) * PORNE (1) 


5.2. K-nearest neighbor 
KNN follows the principle of similarity by calculating distance (euclidean distance) between points. 
euclidean distance is calculated [37] as stated in (2): 


d(x,y) = 
(2) 


Int J Elec & Comp Eng, Vol. 12, No. 1, February 2022: 1030-1039 


Int J Elec & Comp Eng ISSN: 2088-8708 OO 1035 


6. EVALUATION PARAMETERS 
6.1. Accuracy 

The most natural proportion of progress is precision, and it is just the extent of accurately 
anticipated perception to add up to perceptions [9]. As appeared in (3): 


True Positive + True Negative 


A = 
Seer True Positive + False Positive + False Negative + True Negative (3) 


6.2. Precision 
Share in optimistic views to the complete positive perceptions anticipated. The low fake positive 
rate suggests high exactness [9]. 


ee on- True Positive 
ail True Positive + False Positive (4) 
6.3. Recall 

Calculation of how many positive genuine portray by our standard via marking it as constructive 
(true positive) is called recall [9]. 


True Positive 


Recall = 
ae True Posotive + False Negative (5) 


6.4. F1-score 
It is a weighted balance between recall and precision [9]. 


Recall x Precision 


F1 E2 
soore * Recall + Precision (6) 


7. RESULTS AND DISCUSSION 

Figure 3 shows the confusion matrix generated and accuracy score generated after applying word 
level TF-IDF. Figure 4 shows the confusion matrix generated and accuracy score generated after applying 
word level CountVectorizer. Figure 5 shows the confusion matrix generated and accuracy score generated 
after applying word level TF-IDF. Figure 6 shows the confusion matrix generated and accuracy score 
generated after applying word level CountVectorizer. 


Normalized confusion matrix 


True label 


0.2 


0.0 


Predicted label 


Figure 3. Confusion matrix generated for MNB with TF-IDF 
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Normalized confusion matrix 


True label 


Predicted label 


Figure 4. Confusion matrix generated for MNB with CountVectorizer 


Normalized confusion matrix 


10 


True label 


Foz 


00 
Predicted label 


Figure 5. Confusion matrix generated for KNN with TF-IDF 


Normalized confusion matrix 


True label 


Predicted label 


Figure 6. Confusion matrix generated for KNN with CountVectorizer 
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Tables 3 and 4 represent result comparison between MNB and KNN classifier based on various 
performance evolution parameter such as precision, recall, accuracy, and recall. Figures 7 and 8 graphical 
representation comparison of KNN and MNB classifier result which used CountVectorizer and TF-IDF as 
feature selection by considering various performance and we conclude that MNB algorithm generated more 
accurate result than KNN. As per Figure 9 we can represents that both algorithms are performing well using 
TF-IDF and CountVectorizer (CV) as feature selection, but we can say that MNB algorithm generates more 
accurate results with all performance parameters compare to KNN. As per Figure 9 we can represents that 
both algorithms are performing well using TF-IDF and CV as feature selection, but we can say that MNB 
algorithm generates more accurate results with all performance parameters compare to KNN. 


Table 3. Order result with word-level TF-IDF 
Film review dataset in the Gujarati language 
Algorithm Accuracy (%) Precision (%) Recall (%) F1-Score (%) 
MNB 87.14 75.68 100 86.15 
KNN 81.43 72.73 85.71 78.69 


Table 4. Order result with word-level CountVectorizer 
Film review dataset in the Gujarati language 
Algorithm Accuracy (%) Precision (%) Recall (%) Fl-Score (%) 
MNB 81.43 81.43 81.43 81.43 
KNN 81.43 72.73 75.71 78.69 


coreo M 
cc) m 

Precision (%) S 

TET OO 


0 20 40 60 80 100 120 


@KNN @MNB 


Figure 7. MNB and KNN classifier result comparison with TF-IDF 
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Figure 8. MNB and KNN classifier result comparison with CountVectorizer 
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CV Recall (M, 
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TF-IDF Precision (%) D 
CV Accuracy (%) M 
TF-IDF Accuracy (4) 
0 20 40 60 80 100 120 
BKNN Æ MNB 
Figure 9. MNB and KNN classifier result comparison with CountVectorizer and TF-IDF 
CONCLUSION 


In this paper, the film review dataset is prepared by taking reviews in the Gujarati language, and two 


different machine learning-based classification techniques are applied to this data set with count and TF-IDF 
vectorizer elements to assess the reaction of a film review in the Gujarati language. It is concluded that TF- 
IDF Vectorizer features are providing improved results compared to CountVectorizer features after applying 
sentiment analysis. Comparing the results of two different machine learning algorithms based on Accuracy, 
Recall, Precision, and F-score performance parameter, we came to know MNB model forecast opinion more 
accurately with TF-IDF features compare to CountVectorizer features. 
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